A Comparative Study with Different Feature Selection For Arabic Text Categorization

نویسنده

  • Meryeme Hadni
چکیده

Feature Selection benefits a learner by eliminating non-informative or noisy features and by reducing the overall feature space to a manageable size. The Term Feature Selection is used in Machine Learning for the process of selecting a subset of features used to represent the text. In this paper, we propose a new approach for Text Representation based on incorporating background Knowledge Arabic WordNet with using the Feature Selection Techniques. Five methods were evaluated; including term selection based on document frequency (DF), Information Gain (IG), Mutual Information (MI), Chi-Square (CHI) and CHIR. To evaluate its accuracy, the proposed system has been trained and tested with CCA Corpus. We found Chi-Square and CHIR Statistics most effective in our experiments. Keywords—Feature selection; Machine Learning; Text Representation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study

Feature selection is essential for effective and accurate text classification systems. This paper investigates the effectiveness of six commonly used feature selection methods, Evaluation used an in-house collected Arabic text classification corpus, and classification is based on Support Vector Machine Classifier. The experimental results are presented in terms of precision, recall and Macroave...

متن کامل

New stemming for arabic text classification using feature selection and decision trees

In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

A Study of Text Preprocessing Tools for Arabic Text Categorization

Text preprocessing is an essential stage in text categorization (TC) particularly and text mining generally. Morphological tools can be used in text preprocessing to reduce multiple forms of the word to one form. There has been a debate among researchers about the benefits of using morphological tools in TC. Studies in the English language illustrated that performing stemming during the preproc...

متن کامل

A Hybrid Feature Selection Approach for Arabic Documents Classification

Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge number of features. Feature selection tries to find a set of relevant terms to improve both efficiency and generalization. There are two main ap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014